As single-cell technology advances, the number of multi-condition multi-sample datasets increases. This workshop will discuss the challenges and analytical focus associated with disease outcome prediction using single cell data, which is one of the analytics involved in such datasets. The workshop’s overall learning goal is to investigate one possible approach to this task. We will also talk about general analytic strategies and the critical thinking questions that arise in the workflow.
Structure of the 2-hour workshop:
| Activity | Time |
|---|---|
| Introduction + loading data | 20m |
| Extracting features from single cell data | 40m |
| Break | 10m |
| Perform disease outcome classification | 40m |
| Concluding remarks | 10m |
The rise of single-cell or near single-cell resolution omics technologies (e.g. spatial transcriptomics) has enabled the discovery of cell- and cell type specific knowledge and have transformed our understanding of biological systems. Because of the high-dimensionality and complexity, over 1000 tools have been developed to extract meaningful information from the high feature dimensions and uncover biological insights. For example, our previous workshop focuses on characterising the identity the state of cells and the relationship between cells along a trajectory.
While these tools enable characterisation of individual cells, there is a lack of tools that characterise individual samples as a whole based on their cellular properties and investigate how these cellular properties are driving disease outcomes. With the recent surge of multi-condition and multi-sample single-cell studies, the question becomes how do we represent cellular properties at the sample (e.g. individual patient) level for linking such information with the disease outcome and performing downstream analysis such as disease outcome prediction.
In this workshop, we will demonstrate our approach for generating a molecular representation for individual samples and using the representation for a downstream application of disease outcome classification.
First, load all the libraries we will be using in this workshop.
library(ggplot2)
library(plotly)
library(ggthemes)
theme_set(theme_bw())
library(Seurat)
library(dplyr)
library(devtools)
library(RCurl)
library(UpSetR)
library(scFeatures)
library(ClassifyR)
library(limma)
set.seed(2022)
We will use a single-cell RNA-sequencing (scRNA-seq) data of COVID-19 patients for this workshop. The dataset is taken from Schulte-Schrepping et al. 2020. We have subsampled 12 patient samples from this dataset, with the condition being mild/moderate and severe/critical. The original data can be accessed from the European Genome-phenome Archive (EGA) with the accession number EGAS00001004571.
data <- readRDS("~/toy_data/schulte_12patients.rds")
We can visualise the data using dimensionality reduction approaches and colour the individual cells by the severity.
data <- NormalizeData( data)
data <- FindVariableFeatures(data )
data <- ScaleData(data, features = rownames(data) )
data <- RunPCA(data)
data <- FindNeighbors(data, dims = 1:10)
data <- FindClusters(data, resolution = 0.5)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#>
#> Number of nodes: 11679
#> Number of edges: 394547
#>
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.9331
#> Number of communities: 19
#> Elapsed time: 2 seconds
data <- RunUMAP(data , dims = 1:10)
DimPlot(data, reduction = "umap", group.by = "meta_severity")
Interpretation:
Are the cells from mild/moderate and severe/critical patients easy or difficult to distinguish?
In this workshop, we use scFeatures
to generate molecular representation for each patient. The molecular
representation is interpretable and hence facilitates downstream
analysis of the patient. Overall, scFeatures generates features across
six categories representing different molecular views of cellular
characteristics. These include:
- i) cell type proportions
- ii) cell type specific gene expressions
- iii) cell type specific pathway expressions
- iv) cell type specific cell-cell interaction (CCI) scores
- v) overall aggregated gene expressions
- vi) spatial metrics
The different types of features constructed enable a more comprehensive
multi-view understanding of each patient from a matrix of genes x
cells.

Inspect the number of cells in each patient sample, as well as the number of cells in each cell type.
Discussion:
Is there any sample or cell types you should remove from the data?
scFeatures also has a built-in function process_data,
that removes “small samples”, in which the sample has less than 10 cells
for every cell type, as well as “small cell types”, in which the cell
type has less than 10 cells for every sample.
table(data$sample)
#>
#> BN-18_FreshEryLysis_12.05.2020-FreshEryLysis
#> 1000
#> C19-CB-0002_d13-Fresh
#> 1000
#> C19-CB-0002_d8-Frozen
#> 1000
#> C19-CB-0003_d13-Fresh
#> 1000
#> C19-CB-0003_d18-Frozen
#> 1000
#> C19-CB-0008_d13-Fresh
#> 1000
#> C19-CB-0009_d16-Fresh
#> 1000
#> C19-CB-0009_d9-Fresh
#> 938
#> C19-CB-0012_d9-Fresh
#> 741
#> C19-CB-0021_d18-Fresh
#> 1000
#> C19-CB-0198_d18-Fresh
#> 1000
#> C19-CB-0214_d7-Fresh
#> 1000
table(data$celltype)
#>
#> B CD14 Mono CD16 Mono CD4 T CD8 T DC
#> 654 2617 314 2162 725 128
#> DN gdT HSPC ILC intermediate MAIT
#> 14 156 19 1 232 156
#> MAST Neutrophil NK NKT Plasma Platelet
#> 8 2008 1189 811 204 134
#> RBC unassigned
#> 1 146
data <- process_data(data)
table(data$sample)
#>
#> BN-18_FreshEryLysis_12.05.2020-FreshEryLysis
#> 1000
#> C19-CB-0002_d13-Fresh
#> 1000
#> C19-CB-0002_d8-Frozen
#> 1000
#> C19-CB-0003_d13-Fresh
#> 1000
#> C19-CB-0003_d18-Frozen
#> 1000
#> C19-CB-0008_d13-Fresh
#> 1000
#> C19-CB-0009_d16-Fresh
#> 1000
#> C19-CB-0009_d9-Fresh
#> 938
#> C19-CB-0012_d9-Fresh
#> 741
#> C19-CB-0021_d18-Fresh
#> 1000
#> C19-CB-0198_d18-Fresh
#> 1000
#> C19-CB-0214_d7-Fresh
#> 1000
table(data$celltype)
#>
#> B CD14 Mono CD16 Mono CD4 T CD8 T DC
#> 654 2617 314 2162 725 128
#> DN gdT HSPC ILC intermediate MAIT
#> 14 156 19 1 232 156
#> MAST Neutrophil NK NKT Plasma Platelet
#> 8 2008 1189 811 204 134
#> RBC unassigned
#> 1 146
Discussion:
After running the process_data function, are there still
any patient samples or cell types that you should remove from the
data?
All the feature types can be generated in one line of code. This runs
the function using default settings for all parameters, for more
information, type ?scFeatures.
Given that this step may take up to 20 minutes, we have already
generated the result and saved it in the
intermediate_result folder. You could skip this step and
proceed to the next step for the purposes of this workshop.
# Here we label the samples using the severity, such as it is easier later on to retrieve the severity outcomes of the samples.
data$sample <- paste0(data$sample, "_cond_" , data$meta_severity)
scfeatures_result <- scFeatures(data, ncores = 12)
We have generated a total of 13 feature types and stored them in a
list. All generated feature types are stored in a matrix of samples by
features.
For example, the first list element contains the feature type
“proportion_raw”, which contains the cell type proportion features for
each patient sample. We could print out the first 5 columns and first 5
rows of the first element to see.
scfeatures_result <- readRDS("~/intermediate_result/scfeatures_result_schulte_12patients.rds")
# we have generated a total of 13 feature types
names(scfeatures_result)
#> [1] "proportion_raw" "proportion_logit" "proportion_ratio"
#> [4] "gene_mean_celltype" "gene_prop_celltype" "gene_cor_celltype"
#> [7] "pathway_gsva" "pathway_mean" "pathway_prop"
#> [10] "CCI" "gene_mean_bulk" "gene_prop_bulk"
#> [13] "gene_cor_bulk"
# each row is a sample, each column is a feature
scfeatures_result[[1]][1:5, 1:5]
#> B CD14 Mono CD16 Mono
#> C19-CB-0003_d13-Fresh_cond_Mild/Moderate 0.15200000 0.56700000 0.062
#> C19-CB-0002_d13-Fresh_cond_Mild/Moderate 0.23100000 0.27300000 0.032
#> C19-CB-0002_d8-Frozen_cond_Mild/Moderate 0.06700000 0.27800000 0.030
#> C19-CB-0003_d18-Frozen_cond_Mild/Moderate 0.03200000 0.57400000 0.134
#> C19-CB-0009_d9-Fresh_cond_Severe/Critical 0.01599147 0.09275053 0.000
#> CD4 T CD8 T
#> C19-CB-0003_d13-Fresh_cond_Mild/Moderate 0.024000 0.02400000
#> C19-CB-0002_d13-Fresh_cond_Mild/Moderate 0.060000 0.05200000
#> C19-CB-0002_d8-Frozen_cond_Mild/Moderate 0.046000 0.06200000
#> C19-CB-0003_d18-Frozen_cond_Mild/Moderate 0.017000 0.03600000
#> C19-CB-0009_d9-Fresh_cond_Severe/Critical 0.336887 0.05223881
Once the features are generated, you may wish to visually explore the features. For example, with cell type specific gene expression score, a volcano plot would provide more direct insight than a matrix of values.
data <- scfeatures_result$gene_mean_celltype
data <- t(data)
# pick CD14 Mono as an example cell type
data <- data[ grep("CD14", rownames(data)), ]
condition <- unlist( lapply( strsplit( colnames(data), "_cond_"), `[`, 2))
condition <- data.frame(condition = condition )
design <- model.matrix(~condition, data = condition)
fit <- lmFit(data, design)
fit <- eBayes(fit)
tT <- topTable(fit, n = Inf)
tT$gene <- rownames(tT)
p <- ggplot( tT , aes(logFC,-log10(P.Value) , text = gene ) )+
geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
scale_colour_gradient(low="blue",high="red")+
xlab("log2 fold change") + ylab("-log10 p-value") + theme_minimal()
ggplotly(p)
To accommodate for this need, scFeatures contains a function
run_association_study_report that enables the user to
readily visualise and explore all generated features with one line of
code.
Note, because this function knits a HTML file, please paste the following command in the console and run it there.
# specify a folder to store the html report. Here we store it in the current working directory.
output_folder <- getwd()
run_association_study_report(scfeatures_result, output_folder )
Discussion:
Using the HTML, we can look at some of the critical thinking questions that a researcher would ask about the generated features. These questions are exploratory and there is no right or wrong answer.
Now that we have generated the patient representation, in this section we will examine a useful case study of using the representation to perform disease outcome classification.
In this workshop we use the classifyR package to build a classification model. It provides an implementation of a typical framework for classification, including a function that performs repeated cross-validation with one line of code.

We will build a classification model for each feature type.
# First clean the column names
for(i in 1:length(scfeatures_result)){
names(scfeatures_result[[i]]) <- gsub("\\s|[[:punct:]]", ".", names(scfeatures_result[[i]]))
}
# obtain the patient outcome, which is stored in the rownames of each matrix
outcome = scfeatures_result[[1]] %>% rownames %>% strsplit("_cond_") %>% sapply(function(x) x[2])
Recall in the previous section that we have stored the 13 feature types matrix in a list. Instead of retrieving each matrix from the list, classifyR can directly take a list of matrices as an input and run repeated cross-validation model on each matrix individually.
classifyr_result = crossValidate(scfeatures_result,
outcome,
classifier = "kNN",
nFolds = 2,
nRepeats = 5,
nCores = 5 )
To examine the classification model performance, we first need to specify a metric to calculate. Here, we calculate the balanced accuracy.
classifyr_result <- readRDS("~/intermediate_result/classifyr_result_knn.rds")
classifyr_result <- lapply(classifyr_result, function(x) calcCVperformance(x, performanceType = "Balanced Accuracy"))
Format the output and visualise the accuracy using boxplots.
accuracy <- data.frame( lapply(classifyr_result, performance) )
colnames(accuracy ) <- unlist( lapply ( strsplit( names(classifyr_result) , "\\." ), `[` , 1))
accuracy <- reshape2::melt(accuracy)
accuracy_knn_balanced_accuracy <- ggplot(accuracy, aes(x = variable, y = value)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + xlab("Feature types") + ylab("Balanced Accuracy")+ ylim(0,1)
accuracy_knn_balanced_accuracy
Interpretation:
Based on the classification performance, which feature type would you like to focus on at your next stage of analysis?
In classifyR, the default settings perform 5 repeats of 2-fold cross-validation. This means for each feature type, we get a total of 10 models. Therefore, to answer what features are important in the modelling, we first need to examine whether the rankings (or the importance) of each feature in each model (within each feature type) are the same.
To illustrate an example, we use the feature type “gene_mean_celltype” and look at the overlap between the top 20 ranked features in the 10 models.
# look at the list of models
names(classifyr_result)
#> [1] "proportion_raw.kNN.t-test" "proportion_logit.kNN.t-test"
#> [3] "proportion_ratio.kNN.t-test" "gene_mean_celltype.kNN.t-test"
#> [5] "gene_prop_celltype.kNN.t-test" "gene_cor_celltype.kNN.t-test"
#> [7] "pathway_gsva.kNN.t-test" "pathway_mean.kNN.t-test"
#> [9] "pathway_prop.kNN.t-test" "CCI.kNN.t-test"
#> [11] "gene_mean_bulk.kNN.t-test" "gene_prop_bulk.kNN.t-test"
#> [13] "gene_cor_bulk.kNN.t-test"
# pick the fourth model, which is on the feature type gene mean celltype
gene_mean_celltype <- classifyr_result[[4]]
data_list <- gene_mean_celltype@rankedFeatures
data_list <- lapply(data_list , function(df) {
df[, 2][1:20]
})
names(data_list) <- paste0("Model", 1:10)
# Create the UpSet plot
upset_plot <- upset(fromList(data_list), order.by = "freq", nsets = 10)
# Display the plot
upset_plot
Discussion:
Based on the upset plot, is the top ranked features consistent?
This also leads to the next question, how should we determine the final set of top features to focus on?
Here, we use the mean scores across the folds to calculate the final set of top features.
top_features <- lapply(1:10, function(x){
thismodel <- gene_mean_celltype@rankedFeatures[[x]]
thismodel$rank <- 1:nrow(thismodel)
thismodel <- thismodel[sort(rownames(thismodel)), ]
thismodel$model <- x
thismodel
})
top_features <- as.data.frame( do.call(rbind, top_features ) )
top_features <- top_features %>% group_by(feature) %>% dplyr::summarise(rank = mean(rank)) %>% arrange(rank)
print(head(top_features,10))
#> # A tibble: 10 × 2
#> feature rank
#> <chr> <dbl>
#> 1 Neutrophil..SNX3 128.
#> 2 gdT..SATB1 144.
#> 3 Neutrophil..ALOX5AP 165.
#> 4 Neutrophil..FCER1G 171.
#> 5 Neutrophil..SH3BGRL3 172.
#> 6 Neutrophil..PRR13 181.
#> 7 Neutrophil..CAST 183.
#> 8 Neutrophil..GRN 242.
#> 9 Neutrophil..CDC42 245.
#> 10 Neutrophil..CMTM6 272.
Discussion:
Further explore the ranks of the features. Which cell type(s) would you like to focus on at your next stage of analysis ?
Given we build prediction model for each of the feature type, we can examine whether the prediction from each feature type is consistent.
# looking at one repeat
data_list <- lapply( 1:13 , function(i) {
prediction <- classifyr_result[[i]]@predictions
prediction <- prediction[prediction$permutation == 1, ]$class
prediction <- as.data.frame(prediction)
})
data_list <- do.call( cbind, data_list)
colnames(data_list) <- names(classifyr_result)
# count the number of times each row is labeled as "Mild/Moderate"
counts <- apply(data_list == "Mild/Moderate", 1, sum)
# calculate the percentage of times each row is labeled as "Mild/Moderate"
percentages <- counts / ncol(data_list) * 100
consistency <- data.frame(patient = classifyr_result[[1]]@originalNames ,
percentages_mildmoderate = percentages )
consistency
#> patient
#> C19-CB-0002_d13-Fresh_cond_Mild/Moderate C19-CB-0003_d13-Fresh_cond_Mild/Moderate
#> C19-CB-0003_d18-Frozen_cond_Mild/Moderate C19-CB-0002_d13-Fresh_cond_Mild/Moderate
#> C19-CB-0214_d7-Fresh_cond_Mild/Moderate C19-CB-0002_d8-Frozen_cond_Mild/Moderate
#> BN-18_FreshEryLysis_12.05.2020-FreshEryLysis_cond_Severe/Critical C19-CB-0003_d18-Frozen_cond_Mild/Moderate
#> C19-CB-0198_d18-Fresh_cond_Severe/Critical C19-CB-0009_d9-Fresh_cond_Severe/Critical
#> C19-CB-0009_d9-Fresh_cond_Severe/Critical C19-CB-0012_d9-Fresh_cond_Severe/Critical
#> C19-CB-0002_d8-Frozen_cond_Mild/Moderate C19-CB-0008_d13-Fresh_cond_Severe/Critical
#> C19-CB-0003_d13-Fresh_cond_Mild/Moderate C19-CB-0009_d16-Fresh_cond_Severe/Critical
#> C19-CB-0009_d16-Fresh_cond_Severe/Critical C19-CB-0021_d18-Fresh_cond_Severe/Critical
#> C19-CB-0012_d9-Fresh_cond_Severe/Critical C19-CB-0198_d18-Fresh_cond_Severe/Critical
#> C19-CB-0021_d18-Fresh_cond_Severe/Critical C19-CB-0214_d7-Fresh_cond_Mild/Moderate
#> C19-CB-0008_d13-Fresh_cond_Severe/Critical BN-18_FreshEryLysis_12.05.2020-FreshEryLysis_cond_Severe/Critical
#> percentages_mildmoderate
#> C19-CB-0002_d13-Fresh_cond_Mild/Moderate 92.307692
#> C19-CB-0003_d18-Frozen_cond_Mild/Moderate 92.307692
#> C19-CB-0214_d7-Fresh_cond_Mild/Moderate 30.769231
#> BN-18_FreshEryLysis_12.05.2020-FreshEryLysis_cond_Severe/Critical 61.538462
#> C19-CB-0198_d18-Fresh_cond_Severe/Critical 0.000000
#> C19-CB-0009_d9-Fresh_cond_Severe/Critical 23.076923
#> C19-CB-0002_d8-Frozen_cond_Mild/Moderate 84.615385
#> C19-CB-0003_d13-Fresh_cond_Mild/Moderate 84.615385
#> C19-CB-0009_d16-Fresh_cond_Severe/Critical 7.692308
#> C19-CB-0012_d9-Fresh_cond_Severe/Critical 0.000000
#> C19-CB-0021_d18-Fresh_cond_Severe/Critical 7.692308
#> C19-CB-0008_d13-Fresh_cond_Severe/Critical 7.692308
Discussion:
Would ensemble learning using all feature types improve prediction performance compared to using the individual feature type ?
In this section we take a detailed look at the model performance. In particular, a number of factors may affect the classification performance, such as the choice of classification algorithm.
classifyR provides an implementation of a number of commonly used
classification algorithms, including:
- randomForest
- DLDA
- kNN
- GLM - elasticNetGLM - SVM - NSC - naiveBayes
- mixturesNormals
- CoxPH - CoxNet
- randomSurvivalForest
- XGB
classifyr_result = crossValidate(scfeatures_result,
outcome,
classifier = "naiveBayes",
nFolds = 2,
nRepeats = 5,
nCores = 5 )
Compare the accuracy obtained from the different algorithms.
format_accuracy <- function(result_list, performanceType = "Balanced Accuracy"){
result_list <- lapply(result_list, function(x) calcCVperformance(x, performanceType = performanceType ))
accuracy <- lapply(result_list, performance)
accuracy <- lapply(accuracy , `[`, performanceType)
accuracy <- data.frame( accuracy )
colnames(accuracy ) <- unlist( lapply ( strsplit( names(result_list) , "\\." ), `[` , 1))
accuracy <- reshape2::melt(accuracy)
return(accuracy)
}
classifyr_result <- readRDS("~/intermediate_result/classifyr_result_naivebayes.rds")
accuracy <- format_accuracy(classifyr_result )
accuracy_naivebayes_balanced_accuracy <- ggplot(accuracy , aes(x = variable, y = value )) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + xlab("Feature types") + ylab("Balanced Accuracy") + ylim(0,1)
accuracy_naivebayes_balanced_accuracy
Here we calculate accuracy instead of balanced accuracy. Inspect the difference in performance.
accuracy <- format_accuracy(classifyr_result , performanceType = "Accuracy")
accuracy_naivebayes_accuracy <- ggplot(accuracy , aes(x = variable, y = value)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + xlab("Feature types") + ylab("Accuracy")+ ylim(0,1)
accuracy_naivebayes_accuracy
Compare the performance on the same plot
ggpubr::ggarrange(plotlist = list(accuracy_knn_balanced_accuracy,
accuracy_naivebayes_balanced_accuracy,
accuracy_naivebayes_accuracy),
ncol=3)
Generalisable means whether the model can have good performance when tested on an independent dataset.
Here, we use the model built on the Schulte-Schrepping dataset to test on the Wilk dataset. The Wilk data is obtained from Wilk et al. 2022. We sampled 12 patients with mild/moderate and severe/critical conditions.
data <- readRDS("~/toy_data/wilk_full.rds")
First visualise the data.
data <- NormalizeData( data)
data <- FindVariableFeatures(data )
data <- ScaleData(data, features = rownames(data) )
data <- RunPCA(data)
data <- FindNeighbors(data, dims = 1:10)
data <- FindClusters(data, resolution = 0.5)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#>
#> Number of nodes: 112589
#> Number of edges: 3319297
#>
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.9204
#> Number of communities: 19
#> Elapsed time: 105 seconds
data <- RunUMAP(data , dims = 1:10)
DimPlot(data, reduction = "umap", group.by = "meta_severity")
# Here we label the samples using the severity, such as it is easier later on to retrieve the severity outcomes of the samples.
data$sample <- paste0(data$sample, "_cond_" , data$meta_severity)
start <- Sys.time()
scfeatures_result_wilk <- scFeatures(data, ncores = 12)
finish <- Sys.time()
scfeatures_result_schulte <- scfeatures_result
Here we have provided the constructed scFeatures result for the Wilk dataset.
scfeatures_result_schulte <- readRDS("~/intermediate_result/scfeatures_result_schulte_12patients.rds")
scfeatures_result_wilk <- readRDS("~/intermediate_result/scfeatures_result_wilk_12patients.rds" )
scfeatures_result_wilk <- scfeatures_result_wilk[ names(scfeatures_result_schulte )]
# in order to train and test two different datasets, need to make sure the features are the same
# pick the common features
new_list <- lapply( 1:length(scfeatures_result_schulte), function(i){
common_features <- intersect(colnames( scfeatures_result_schulte[[i]]) ,
colnames( scfeatures_result_wilk[[i]]))
new_schulte <- scfeatures_result_schulte[[i]][, common_features]
new_wilk <- scfeatures_result_wilk[[i]][, common_features]
list(new_schulte, new_wilk)
})
# format the features
new_scfeatures_result_schulte <- lapply(new_list, `[[`, 1)
new_scfeatures_result_wilk <- lapply(new_list, `[[`, 2)
names(new_scfeatures_result_schulte) <- names(scfeatures_result_schulte)
names(new_scfeatures_result_wilk) <- names(scfeatures_result_wilk)
# obtain the outcome, which is stored in the rownames of ech matrix
outcome_schulte = scfeatures_result_schulte[[1]] %>% rownames %>% strsplit("_cond_") %>% sapply(function(x) x[2])
outcome_wilk = scfeatures_result_wilk[[1]] %>% rownames %>% strsplit("_cond_") %>% sapply(function(x) x[2])
names(outcome_wilk) <- rownames(scfeatures_result_wilk[[1]])
First, we use the dataset from Schulte-Schrepping to perform self cross-validation. The purpose is to identify the best model (the best parameters). Once we decide on the best model, we then use this model to test on the Wilk dataset.
# First clean the column names
for(i in 1:length(new_scfeatures_result_schulte)){
names(new_scfeatures_result_schulte[[i]]) <- gsub("\\s|[[:punct:]]", ".", names(new_scfeatures_result_schulte[[i]]))
}
# First clean the column names
for(i in 1:length(new_scfeatures_result_wilk)){
names(new_scfeatures_result_wilk[[i]]) <- gsub("\\s|[[:punct:]]", ".", names(new_scfeatures_result_wilk[[i]]))
}
# for each feature type, identify the best model and use the best model to predict on Wilk
result_generalisability <- NULL
for( i in c(1:length( new_scfeatures_result_schulte )) ){
model <- train(x = new_scfeatures_result_schulte[[i]] ,outcomeTrain = outcome_schulte,
performanceType = "Balanced Accuracy", classifier = "naiveBayes" )
prediction <- predict( model , new_scfeatures_result_wilk[[i]])
truth <- outcome_wilk[names(prediction)]
temp <- calcExternalPerformance( factor(truth) , prediction,
performanceType = c("Balanced Accuracy"))
temp <- data.frame(balanced_accuracy = temp,
featuretype = names(new_scfeatures_result_schulte)[[i]] )
result_generalisability <- rbind(result_generalisability, temp)
}
Visualise the accuracy using boxplots.
result_list_generalisability <- readRDS("~/intermediate_result/result_list_generalisability.rds")
result_list_generalisability$balanced_accuracy <- round( result_list_generalisability$balanced_accuracy, 2)
result_list_generalisability$featuretype <- factor( result_list_generalisability$featuretype , levels = c( result_list_generalisability$featuretype))
ggplot( result_list_generalisability, aes(x = featuretype , y = balanced_accuracy, fill = featuretype ) ) + geom_col()+ theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + ylim(0,1) + scale_fill_tableau(palette = "Tableau 20") +
geom_text(aes(label = balanced_accuracy), vjust = -0.5)
Discussion:
Examine the classification accuracy and comment on the generalisability of the model.
Final discussion:
How good do you think our model is? What parts of the workflow could you change to improve the results?